Applied Spatial Linking

GESIS Workshop: Introduction to Geospatial Techniques for Social Scientists in R

Stefan Jünger & Dennis Abel

2025-04-10

Now

Day Time Title
April 09 10:00-11:30 Introduction
April 09 11:30-11:45 Coffee Break
April 09 11:45-13:00 Data Formats
April 09 13:00-14:00 Lunch Break
April 09 14:00-15:30 Mapping I
April 09 15:30-15:45 Coffee Break
April 09 15:45-17:00 Spatial Wrangling
April 10 09:00-10:30 Mapping II
April 10 10:30-10:45 Coffee Break
April 10 10:45-12:00 Applied Spatial Linking
April 10 12:00-13:00 Lunch Break
April 10 13:00-14:30 Spatial Autocorrelation
April 10 14:30-14:45 Coffee Break
April 10 14:45-16:00 Spatial Econometrics & Outlook

What are georeferenced data?

Data with a direct spatial reference \(\rightarrow\) geo-coordinates

  • Information about geometries
  • Optional: Content in relation to the geometries

Sources: OpenStreetMap / GEOFABRIK (2018), City of Cologne (2014), and the Statistical Offices of the Federation and the Länder (2016) / Jünger, 2019

Georeferenced survey data

Survey data enriched with geo-coordinates (or other direct spatial references).

With georeferenced survey data, we can analyze interactions between individual behaviors and attitudes and the environment.

An example workflow

:::: ::: {.column width=“50%”} From the addresses to analyses with georeferenced survey data, several steps and challenges along the way. We will talk about:

  • Data Protection & Data Access
  • Geocoding
  • Spatial Data Linking
  • Applied Examples :::

::::

Data protection

That‘s one of the biggest issues.

  • Explicit spatial references increase the risk of re-identifying anonymized survey respondents
  • Can occur during the processing of data but also during the analysis

Affects all phases of research and data management!

Data availability

Geospatial Data

  • Often de-centralized distributed
  • Fragmented data landscape, at least in Germany

Georeferenced Survey Data

  • Primarily, survey data
  • Depends on documentation
  • Access difficult due to data protection restrictions

https://www.eea.europa.eu/data-and-maps https://datasearch.gesis.org/ https://datasetsearch.research.google.com/

::::

Distribution & re-identification risk

Even without (in)direct spatial references, data may still be sensitive.

  • Geospatial attributes add new information to existing data
  • Maybe part of general data privacy checks, but we may not distribute these data as is

:::: ::: {.column width=“50%”} Safe Rooms / Secure Data Centers

  • Control access
  • Checks output :::

https://www.gesis.org/en/services/processing-and-analyzing-data/guest-research-stays/secure-data-center-sdc

::::

Geocoding

Geocoding is the conversion of indirect spatial references (e.g., addresses) into direct spatial references (e.g., coordinates)

However, conducting this procedure is tricky (not only in R). Many services are either

  • Expensive (at least they cost money or have other restrictions)
  • Probably not data protection-friendly (Hey Google)
  • Or both

OSM Is Your Friend

We can use the Nominatim API from OSM to geocode at least a few addresses.

library(tibble)
library(tidygeocoder)

leibniz_addresses <-
  tibble::tribble(
    ~street, ~housenumber, ~zip_code, ~place, ~institute,
    "B 2", "1", "68159", "Mannheim", "GESIS",
    "Unter Sachsenhausen", "6-8",  "50667", "Köln", "GESIS",
    "Kellnerweg", "4", "37077", "Göttingen", "DPZ",
    "Reichsstr.", "4-6", "04109",  "Leipzig", "GWZO",
    "Schöneckstraße", "6", "79104", "Freiburg", "KIS",
    "Albert-Einstein-Straße", "29a", "18059", "Rostock", "LIKAT",
    "L7", "1", "68161", "Mannheim", "ZEW",
    "Müggelseedamm", "310", "12587", "Berlin", "IGB",
    "Campus D2", "2", "66123", "Saarbrücken", "INM",
    "Eberswalder Straße", "84", "15374", "Müncheberg (Mark)", "ZALF"
  ) |> 
  dplyr::mutate(whole_address = paste(street, housenumber, zip_code, place))

Run the Geocoding

leibniz_addresses <-
  tidygeocoder::geocode(
    leibniz_addresses,
    address = whole_address
  )

leibniz_addresses
# A tibble: 10 × 8
   street         housenumber zip_code place institute whole_address   lat  long
   <chr>          <chr>       <chr>    <chr> <chr>     <chr>         <dbl> <dbl>
 1 B 2            1           68159    Mann… GESIS     B 2 1 68159 …  49.5  8.45
 2 Unter Sachsen… 6-8         50667    Köln  GESIS     Unter Sachse…  50.9  6.95
 3 Kellnerweg     4           37077    Gött… DPZ       Kellnerweg 4…  51.6  9.95
 4 Reichsstr.     4-6         04109    Leip… GWZO      Reichsstr. 4…  51.3 12.4 
 5 Schöneckstraße 6           79104    Frei… KIS       Schöneckstra…  48.0  7.86
 6 Albert-Einste… 29a         18059    Rost… LIKAT     Albert-Einst…  54.1 12.1 
 7 L7             1           68161    Mann… ZEW       L7 1 68161 M…  49.5  8.47
 8 Müggelseedamm  310         12587    Berl… IGB       Müggelseedam…  52.4 13.6 
 9 Campus D2      2           66123    Saar… INM       Campus D2 2 …  49.3  7.04
10 Eberswalder S… 84          15374    Münc… ZALF      Eberswalder …  NA   NA   

Convert To sf Object And Plot

leibniz_addresses_sf <-
  leibniz_addresses |> 
  dplyr::filter(!is.na(lat)) |> 
  sf::st_as_sf(coords = c("long", "lat"), crs = 4326)

tmaptools::read_osm(
  leibniz_addresses_sf, 
  type = "esri-topo"
) |> 
  terra::rast() |> 
  tm_shape() +
  tm_rgb() +
  tm_shape(leibniz_addresses_sf) +
  tm_dots(size = 2, col = "red")

Our Approach

We rely on a service offered by the Federal Agency of Cartography and Geodesy (BKG):

  • Online interface and API for online geocoding
  • Offline geocoding possible based on raw data
  • But: Data and service are restricted

bkggeocoder

:::: ::: {.column width=“50%”} R package bkggeocoder developed at GESIS for (offline) geocoding by Stefan and Jonas Lieth:

::::

Spatial Linking

:::: ::: {.column width=“50%”} The geocoding tool automatically retrieves point coordinates, administrative unit keys, and grid cell IDs. Spatial joins based on coordinates for other units:

  • Constituencies
  • Administrative units across time (e.g., harmonized territorial status) :::

Sources: OpenStreetMap / GEOFABRIK (2018), City of Cologne (2014), Leibniz Institute of Ecological Urban and Regional Development (2018), Statistical Offices of the Federation and the Länder (2016), and German Environmental Agency / EIONET Central Data Repository (2016) / Jünger, 2019

::::

Data Linking

Linking via common identifiers is most commonly used but comes with challenges (e.g., territorial status and land reforms? Comparable units? Heterogeneity within units?).

Spatial linking methods (examples) I

1:1

(sf::st_join())

Distances

(sf::st_distance())

Sources: German Environmental Agency / EIONET Central Data Repository (2016) and OpenStreetMap / GEOFABRIK (2018) / Jünger, 2019

Spatial linking methods (examples) II

Filter methods

(sf::st_filter() or terra::vect(..., filter = ...))

Buffer zones

(sf::st_buffer())

Sources: Leibniz Institute of Ecological Urban and Regional Development (2018) and Statistical Offices of the Federation and the Länder (2016) / Jünger, 2019

Cheatsheet: Spatial Operations

An overview of spatial operations using the sf package can be accessed here.

Data Aggregation

If you want to aggregate attributes and geometries of a shapefile, you can rely on st_combine(x) , st_union(x,y) and st_intersection(x, y) to combine shapefiles, resolve borders and return the intersection of two shapefiles.

For raster data, you can aggregate with the function terra::aggregate()(if you have matching raster files) in combination with terra::resample() (if your raster files don’t match).

To deal with spatial misalignment:

Data aggregation

german_districts <-
  sf::read_sf("./data/VG250_KRS.shp") |> 
  dplyr::mutate(
    federal_state =
      as.numeric(stringr::str_sub(AGS,1,2))
  ) 

german_states <-
  german_districts |>  
  dplyr::group_by(federal_state) |>  
  dplyr::summarize(
    geometry = sf::st_union(geometry)
  )

tm_shape(german_states) + 
  tm_borders()

Fake research question

Say we’re interested in the impact of neighborhood characteristics (e.g., mobility infrastructure) on individual-level attitudes towards energy transition.

We plan to conduct a survey which is representative of the population of Germany.

https://imgflip.com/i/9ptcuu

Synthetic georeferenced survey data

We have added a synthetic dataset of 2,000 geocoordinates in the ./data/ folder (aggregated to 1 sq km centroids). Initially, it was based on a sample of the georeferenced GESIS Panel.

synthetic_survey_coordinates <-
  readRDS("./data/synthetic_survey_coordinates.rds")

tmaptools::read_osm(
  synthetic_survey_coordinates, 
  type = "esri-topo"
) |> 
  terra::rast() |> 
  tm_shape() +
  tm_rgb() +
  tm_shape(synthetic_survey_coordinates) +
  tm_dots(size = 2, col = "red")

Correspondence table

As in any survey that deals with addresses, we need a correspondence table of the distinct identifiers.

correspondence_table <-
  dplyr::bind_cols(
    id_survey = 
      stringi::stri_rand_strings(10000, 10) |>  
      sample(2000, replace = FALSE),
    id = synthetic_survey_coordinates$id
  )

correspondence_table
# A tibble: 2,000 × 2
   id_survey     id
   <chr>      <int>
 1 8SQJfXMMVS     1
 2 kQEnbZoiqZ     2
 3 kgfR5rWkrE     3
 4 yUTKHhxIi8     4
 5 wdRO2lHx03     5
 6 BzhXilhllD     6
 7 YRsG4UDnaL     7
 8 ghuc6hjN7K     8
 9 2E58Cre0Pr     9
10 4bxhYwJiRE    10
# ℹ 1,990 more rows

Conduct the survey

We ‘ask’ respondents for some standard sociodemographics. But we also include an item from the GLES Panel on energy transformation: “From 2030, no more new cars with petrol or diesel engines are to be registered in Germany. How much do you agree?” (entrans). Since we cannot share the actual data, we created fake data using the faux package.

fake_survey_data <- 
  dplyr::bind_cols(
    id = correspondence_table$id,
    age = sample(18:100, 2000, replace = TRUE),
    gender = 
      sample(1:3, 2000, replace = TRUE, prob = c(.45, .45, .1)) |> 
      as.factor(),
    education =
      sample(1:4, 2000, replace = TRUE) |>  
      as.factor(),
    income =
      sample(100:10000, 2000, replace = TRUE),
    entrans = secret_variable_we_are_hiding_from_you
  )

What could explain our ?

Access to charging infrastructure

Better access to charging infrastructure means higher support for energy transformation.

Rural-urban divide

Higher population density means higher support for energy transformation.

District-level data

We already have most of our information on the district level from yesterday.

district_attributes <-
  # load district shapefile
  sf::read_sf("./data/VG250_KRS.shp") |> 
  # add attribute table
  dplyr::left_join(
    readr::read_delim("./data/attributes_districts.csv", delim = ";"), 
    by = "AGS"
  ) 

District operationalization

Access to charging infrastructure

Charging stations per 1000 inhabitants in a district

Rural-urban divide

Population Density in a district

Access to charging infrastructure

Luckily, we did something similar yesterday!

charger_data <- 
  # Load charging station points 
  readr::read_delim("./data/charging_points_ger.csv", delim = ";") |> 
  # Filter out rows with missing longitude or latitude
  dplyr::filter(!is.na(longitude) & !is.na(latitude)) |> 
  # Convert data frame to sf object
  sf::st_as_sf(coords = c("longitude", "latitude"), crs = 4326) |> 
  # Reproject the spatial data to the desired CRS (Coordinate Reference System)
  sf::st_transform(crs = sf::st_crs(district_attributes))

aggregated_charger_data <-
  charger_data |> 
  # spatial join district ids
  sf::st_join(district_attributes |>  dplyr::select(AGS)) |>  
  # Group by district ID
  dplyr::group_by(AGS) |> 
  # Summarize the number of chargers in each district
  dplyr::summarise(charger_count = dplyr::n()) |> 
  # Drop geometry column
  sf::st_drop_geometry()

district_attributes <-
  # Left join with sampling area attributes
  dplyr::left_join(
    district_attributes, aggregated_charger_data, by = "AGS"
  ) |> 
  # Calculate charger density per 1000 population
  dplyr::mutate(charger_dens = (charger_count * 1000) / population)

Rural-urban divide

Our attribute table contains the number of inhabitants per district but not the population density. Therefore, we need to calculate the area of the district.

# calculate area of districts
# areas will always be calculated
# in units according to the CRS 
sf::st_area(district_attributes) |> 
  head(4)
Units: [m^2]
[1]  49247856 112246123 211583605  71821681
district_attributes |> 
  sf::st_transform(4326) |>  
  sf::st_area() |> 
  head(4)
Units: [m^2]
[1]  49066201 111830797 210783132  71562061

Population density

All left to do is a simple mutation:

# calculating population density
district_attributes <-
  district_attributes |>  
  # calculate area of districts (areas will always
  # be calculated in units according to the CRS )
  dplyr::mutate(
    area = sf::st_area(district_attributes)
  ) |> 
  # change unit to square kilometers
  dplyr::mutate(
    area_km2 = units::set_units(area, km^2)
  ) |> 
  # recode variable as numeric
  dplyr::mutate(
    area_km2 = as.numeric(area_km2)
  ) |> 
  # calculate population density
  dplyr::mutate(
    pop_dens = population/  area_km2
  )

tm_shape(district_attributes) +
  tm_fill(
    "pop_dens", 
    fill.scale = 
      tm_scale(
        breaks = c(0,100,200,400,800,1600,3200, Inf)
      )
  ) +
  tm_layout(legend.outside = TRUE)

Respondents in districts

We have population density on the district level. Since our analysis focuses on the individual level, we can spatially join the information to our fake respondents’ coordinates.

district_linked_df <-
  district_attributes |> 
  sf::st_transform(sf::st_crs(synthetic_survey_coordinates)) |> 
  # keeping just the variables we want
  dplyr::select(charger_dens, publictransport_meandist, pop_dens) |>  
  # since we want to join district to
  # respondent defining coordinates first
  sf::st_join(
    x = synthetic_survey_coordinates,
    # district data second
    y = _,
    # some points may lie on the border
    # choosing intersects 
    join = sf::st_intersects
  ) |>  
  # drop our coordinates for data protection
  sf::st_drop_geometry()

Respondents in districts

head(district_linked_df, 5)
# A tibble: 5 × 4
     id charger_dens publictransport_meandist pop_dens
  <int>        <dbl>                    <dbl>    <dbl>
1     1        0.619                     9628     38.2
2     2        0.573                     9152     80.9
3     3        0.768                     3687    134. 
4     4        0.573                     9152     80.9
5     5        0.846                     2155    845. 

Too boring? Let’s scale it down!

We have our nice fake coordinates and know we also have variations in some districts (e.g., Cologne) concerning e-car mobility. Let’s try to operationalize the variables on a smaller level of aggregation.

Access to charging infrastructure > Charging stations in a 5000m buffer

Rural-urban divide > Population in a 5000m buffer

Charging stations in 5000m buffer

The procedure for calculating the number of chargers in a 5km buffer is very similar to calculating the chargers in a district.

# Create 5000m buffers around the fake coordinates
buffers <- 
  synthetic_survey_coordinates |> 
  sf::st_buffer(dist = 5000)

# Perform intersection between buffers and points_sf
inter <- 
  sf::st_intersects(buffers, charger_data |> sf::st_transform(3035))

# Count points within each buffer
coordinate_linked_df <- 
  synthetic_survey_coordinates |> 
  dplyr::mutate(num_charger = lengths(inter))

Distance calculation II

sf::st_distance() will calculate between all respondents and all train stations resulting in a huge matrix. We can make our lives easier by first identifying the nearest station and then calculating the distance.

# Find the nearest charging station 
nearest_charger <- 
  sf::st_nearest_feature(
    synthetic_survey_coordinates, 
    charger_data |> 
      sf::st_transform(3035)
  )

# Calculate the distance between each point in
# fake_coordinates & its nearest charging station
coordinate_linked_df <- 
  coordinate_linked_df |> 
  dplyr::mutate(
    charger_distances =
      sf::st_distance(
        synthetic_survey_coordinates, 
        charger_data[nearest_charger,] |> 
          sf::st_transform(3035), 
        by_element = TRUE
      ) |>
      as.vector()
  )

Distance calculation II

# add a column for the distances
coordinate_linked_df  <- 
  coordinate_linked_df |> 
  dplyr:: mutate(
    # Calculate distances in kilometers 
    charg_dist_km = as.numeric(charger_distances) / 1000) 

summary(coordinate_linked_df$charg_dist_km)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
 0.01544  0.36216  0.72309  1.39690  1.80339 13.65649 

Population buffers

…and we’re not yet done: we still need the population in the neighborhood. Let’s calculate buffers of 5000 meters and add the population mean values to our dataset.

# download data & extract information
inhabitants <- z11::z11_get_1km_attribute(Einwohner)

# spatially link "on the fly"
population_buffers <- 
  terra::extract(
    inhabitants, 
    synthetic_survey_coordinates |>  
      sf::st_buffer(5000), 
    fun = mean,
    na.rm = TRUE,
    ID = FALSE,
    raw = TRUE
  ) |> 
  unlist()

# link with data 
coordinate_linked_df <-
  coordinate_linked_df |> 
  dplyr::mutate(population_buffer = population_buffers)

Join with Survey

We hope you’re not tired of joining data tables. Since we care a bit more about data protection, we have yet another joining task: to join the information we received using our (protected) fake coordinates to the actual survey data via the correspondence table.

# last joins for now
fake_survey_data_spatial <-
  # first join the id
  dplyr::left_join(correspondence_table, district_linked_df, by = "id") |> 
  dplyr::left_join(coordinate_linked_df,  by = "id") |> 
  # join the survey data
  dplyr::left_join(fake_survey_data, by = "id") |> 
  dplyr::select(-id)

Correlation Analysis

fake_survey_data_spatial |> 
  dplyr::select(
    entrans, 
    pop_dens, 
    charger_dens, 
    publictransport_meandist,
    charg_dist_km, 
    num_charger, 
    population_buffer
  ) |>  
  corrr::correlate() |> 
  corrr::network_plot(min_cor = .1)

Exercise 6: Spatial Joins

Exercise